14 research outputs found
Learning Language Representations for Typology Prediction
One central mystery of neural NLP is what neural models "know" about their
subject matter. When a neural machine translation system learns to translate
from one language to another, does it learn the syntax or semantics of the
languages? Can this knowledge be extracted from the system to fill holes in
human scientific knowledge? Existing typological databases contain relatively
full feature specifications for only a few hundred languages. Exploiting the
existence of parallel texts in more than a thousand languages, we build a
massive many-to-one neural machine translation (NMT) system from 1017 languages
into English, and use this to predict information missing from typological
databases. Experiments show that the proposed method is able to infer not only
syntactic, but also phonological and phonetic inventory features, and improves
over a baseline that has access to information about the languages' geographic
and phylogenetic neighbors.Comment: EMNLP 201
Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models
Cognitive psychologists have documented that humans use cognitive heuristics,
or mental shortcuts, to make quick decisions while expending less effort. While
performing annotation work on crowdsourcing platforms, we hypothesize that such
heuristic use among annotators cascades on to data quality and model
robustness. In this work, we study cognitive heuristic use in the context of
annotating multiple-choice reading comprehension datasets. We propose tracking
annotator heuristic traces, where we tangibly measure low-effort annotation
strategies that could indicate usage of various cognitive heuristics. We find
evidence that annotators might be using multiple such heuristics, based on
correlations with a battery of psychological tests. Importantly, heuristic use
among annotators determines data quality along several dimensions: (1) known
biased models, such as partial input models, more easily solve examples
authored by annotators that rate highly on heuristic use, (2) models trained on
annotators scoring highly on heuristic use don't generalize as well, and (3)
heuristic-seeking annotators tend to create qualitatively less challenging
examples. Our findings suggest that tracking heuristic usage among annotators
can potentially help with collecting challenging datasets and diagnosing model
biases.Comment: EMNLP 202
Commonsense Knowledge Base Completion with Structural and Semantic Context
Automatic KB completion for commonsense knowledge graphs (e.g., ATOMIC and
ConceptNet) poses unique challenges compared to the much studied conventional
knowledge bases (e.g., Freebase). Commonsense knowledge graphs use free-form
text to represent nodes, resulting in orders of magnitude more nodes compared
to conventional KBs (18x more nodes in ATOMIC compared to Freebase
(FB15K-237)). Importantly, this implies significantly sparser graph structures
- a major challenge for existing KB completion methods that assume densely
connected graphs over a relatively smaller set of nodes. In this paper, we
present novel KB completion models that can address these challenges by
exploiting the structural and semantic context of nodes. Specifically, we
investigate two key ideas: (1) learning from local graph structure, using graph
convolutional networks and automatic graph densification and (2) transfer
learning from pre-trained language models to knowledge graphs for enhanced
contextual representation of knowledge. We describe our method to incorporate
information from both these sources in a joint model and provide the first
empirical results for KB completion on ATOMIC and evaluation with ranking
metrics on ConceptNet. Our results demonstrate the effectiveness of language
model representations in boosting link prediction performance and the
advantages of learning from local graph structure (+1.5 points in MRR for
ConceptNet) when training on subgraphs for computational efficiency. Further
analysis on model predictions shines light on the types of commonsense
knowledge that language models capture well.Comment: AAAI 202
QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations
Formulating selective information needs results in queries that implicitly
specify set operations, such as intersection, union, and difference. For
instance, one might search for "shorebirds that are not sandpipers" or
"science-fiction films shot in England". To study the ability of retrieval
systems to meet such information needs, we construct QUEST, a dataset of 3357
natural language queries with implicit set operations, that map to a set of
entities corresponding to Wikipedia documents. The dataset challenges models to
match multiple constraints mentioned in queries with corresponding evidence in
documents and correctly perform various set operations. The dataset is
constructed semi-automatically using Wikipedia category names. Queries are
automatically composed from individual categories, then paraphrased and further
validated for naturalness and fluency by crowdworkers. Crowdworkers also assess
the relevance of entities based on their documents and highlight attribution of
query constraints to spans of document text. We analyze several modern
retrieval systems, finding that they often struggle on such queries. Queries
involving negation and conjunction are particularly challenging and systems are
further challenged with combinations of these operations.Comment: ACL 2023; Dataset available at
https://github.com/google-research/language/tree/master/language/ques
ExpertQA: Expert-Curated Questions and Attributed Answers
As language models are adapted by a more sophisticated and diverse set of
users, the importance of guaranteeing that they provide factually correct
information supported by verifiable sources is critical across fields of study
& professions. This is especially the case for high-stakes fields, such as
medicine and law, where the risk of propagating false information is high and
can lead to undesirable societal consequences. Previous work studying
factuality and attribution has not focused on analyzing these characteristics
of language model outputs in domain-specific scenarios. In this work, we
present an evaluation study analyzing various axes of factuality and
attribution provided in responses from a few systems, by bringing domain
experts in the loop. Specifically, we first collect expert-curated questions
from 484 participants across 32 fields of study, and then ask the same experts
to evaluate generated responses to their own questions. We also ask experts to
revise answers produced by language models, which leads to ExpertQA, a
high-quality long-form QA dataset with 2177 questions spanning 32 fields, along
with verified answers and attributions for claims in the answers.Comment: Dataset & code is available at
https://github.com/chaitanyamalaviya/expertq